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^L ■ A recently developed linear algebraic method for the computation of perturbation expansion 

Q^ ' coefficients to large order is applied to the problem of a hydrogenic atom in a magnetic field. We take 

^\ , as the zeroth order approximation the D -^ oo limit, where D is the number of spatial dimensions. 

In this pseudoclassical limit, the wavefunction is localized at the minimum of an effective potential 

surface. A perturbation expansion, corresponding to harmonic oscillations about this minimum and 
Q . higher order anharmonic correction terms, is then developed in inverse powers of (D — 1) about this 

y-^ ' limit, to 30th order. To demonstrate the implicit parallelism of this method, which is crucial if it 

^^ is to be successfully applied to problems with many degrees of freedom, we describe and analyze 

a particular implementation on massively parallel Connection Machine systems (CM-2 and CM-5). 

After presenting performance results, we conclude with a discussion of the prospects for extending 

this method to larger systems. 
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I. INTRODUCTION 



^~|^ • The spatial dimension has long been treated as a variable parameter in analyzing critical phenomena and in other 

~~J ' areas of physics.lil However, only in the past ten years has this concept been extensively apolied to atomic and 

i^ molecular systems, particularly to develop dimensional scaling methods for electronic structured The motivation for 

f^ ■ this unconventional approach is that the Schrodinger equation reduces to easily solvable forms in the limits Z? — > 1 

C^t and/or D — > oo. When both limiting solutions are available, interpolation in 1/ D may be used to approximate tha 

a physically meaningful I? = 3 resiilt; this has yielded excellent results for correlation energies of two-electron atomsci 

and for H2 Hartree-Fock energies^ Alternatively, if the D ^ 00 solutions are avaliable for both the problem of interest 
~H ' and a simpler model problem (e.g., Hartree-Fock) for which D = 3 results are easier to calculate, the latter may be 
fj used to "renormalize" some parameter (e.g., nuclear charge). Then the D ^ 00 solution with_thc renormalized 
r* ' parameter may give a good approximation to the D — 3 solution with the actual parameter value.u 
. 5^ ' Our work deals with another widely applicable dimensional scaling method, a perturbation expansion in inverse 

^^ . powers of 13 or a related function, about the solution for the D ^ oo limit. That limit is pseudoclassical and readily 
%-{ ' evaluated, as it reduces to the simple problem of minimizing an effective potential functionfl For large but finite 
. 5t 1 D, the first-order correction accounts for harmonic oscillations about this minimum, and higher-order terms pppvide 
anharmonic corrections. Dimensional perturbation theory has been apolied quite successfully to the grouncB and 
some excited statesQ two-electron atoms and to the hydrogen molecule-ioro using a "moment method" to solve the set 
of perturbation equations. However, this method is not easily extended to larger systems. It also requires a different 
program for each eigenstate, and does not directly provide an expansion for the wavefunction. 

A recently developed linear algebraic method has overcome these shortcomings^ This is conceptually quite simple, 
so can easily be applied to systems with any number of degrees of freedom. It permits calculation of ground and 
excited energy levels using a single program and the wavefunction expansion coefficients are directly obtained in the 
course of computing the perturbation expansion for the energy. This method has thus far been applied to central force 
problems, incljyding quasibound states for which the complex eigenenergy represents both the location and width of 
the resonanceJl^ 

The linear algebraic version of dimensional perturbation theory is also well suited to parallel computation. Here 
we demonstrate this for a prototype problem with two degrees of freedom, the hydrogen atom in a magnetic field. 
This system has received much attention; it exhibits chaotic behavior and poses difficulties that have challenged many 
theoretical techniques. Most theoretical approaches treat either the magnetic field or the Coulomb potential as a 
perturbation and therefore work best near either the low- or high-field limit, respectively. However, the leading terms 
of a perturbation expansion in inverse powers of D include major portions of the nonseparable interactions in all field 
strengths. The efficacy of methods equivalent to the 1/D expansion has been demonstrated for the hydrogen atom in 



a magnetic fieldlil and for kindred problems with an electric field or crossed electric and magnetic fields,ll3 although 
not in formulations suited to parallel computation. The present paper is devoted solely to implementing of the linear 
algebraic method on the Connection Machine, and to evaluating the performance of the computational algorithm as 
well as prospects for treating systems with more degrees of freedom. Numerical results-|for the ground and several 
excited states over a wide range of field strengths will be presented in a separate paper Jl3 

II. THEORY 

The Schrodinger equation for a nonrelativistic hydrogenic atom in a uniform magnetic field along the z axis, in 
atomic units and cylindrical coordinates, is given by 
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Here ra is the azimuthal quantum number and the magnetic field B is measured in units m'le^c/fc' ~ 2.35 x 10^ G.O 
We can eliminate the first derivative term with the substitution 



ci>(p,z) = pi/2^(p,z), (2) 



which gives 



1 ^ 32 , 32 ^ , m^ _ 1 „j5 52^2 z 



2 \dp^ dz^J 2p2 2 8 ^p2 + ^2 



Mp,z)=E^p,z). (3) 



The two good quantum numbers are m and the z parity |4C2, although the observation of exponentially small level 
anticrossings suggests a "hidden" approximate symmctry.Ej The trivial (i.e. diagonal in each m^ subspace) mB/2 
term may be dropped, to be added at the end of the calculation. 

Now we consider the more general case where p is the radius of a {D ~ l)-dimensional hypersphere (with D — 2 
remaining angles), so that the total number of spatial dimensions is D when we include the component z parallel to 
the magnetic field. In a manner completely analogous to that described above, we may eliminate the first derivative 
in the radial part of the Laplacian, 
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by introducing the substitution 



$(p,z)-p(^-2)/2vl/(p,^), (4) 

which leads to the same form as Eq. (^), with m2 — 1/4 replaced by A(A + 1) where A = |m| + i(I? — 4).Ea As before, we 
drop the mBj2. term. Since the resulting equation depends on m and D only through A, we sec that intcrdimcnsional 
degeneraciestJ connect states (|77i|,D) with {\m\ — 1,13 + 2). It proves convenient to define a scaling parameter as 
K = D + 2\m\ — 1 and take A = ^{k — 3). The entire \m\ = 0, 1, 2, . . . spectrum for the real three-dimensional system 
can then be recovered simply by obtaining the solutions for k — 2,4,6, . . ., respectively. 
On introducing dimensionally scaled variables defined by 

p = K^p, z — K^z, B = K^B, e = K E, (5) 

we obtain the dimension-dependent Schrodinger equation 
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where S = 1/k is treated as a continuous perturbation parameter. The same form may he obtained simply by replacing 
\m\ in Eq. (y) by (k — 2)/2; this is the procedure employed by Bender and coworkers.U 
We see that in the (5 — » limit, the problem reduces to finding the minimum of the effective potential 
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Clearly, the effective potential is minimized for z = Zm = 0, so we are left with the straightforward problem of 
minimizing a function of one variable to determine pm and V„o:{pm,Zm)- In the zero-field limit, pm = (4^)~^ and 
V^ff(/5m, Zm) = —2Z^ in the scaled energy units of Eq. (||); in unsealed units this becomes 
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the correct expression for the ground state energy in each azimuthal manifold. In the strong-field limit, pm ~ B ^" 
and Vg{j(p„i, Zm) ~ B/A, or in unsealed units 
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which is simply the coptinuum threshold in each m^' subspace. This appropriate behavioL-in both limits, which has 
been noted previously,Lil is the motivation behind the definition of k, which we have used.LJ 

For large but finite k, the system will undergo harmonic oscillations about this minimum, so we introduce dimension- 
scaled displacement coordinates xi and X2 through 



P = Pm + (5^/^X1, 
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Substituting into Eq. (^) leads to 
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We next expand the wavefunction and energy in Eq. (O) as 
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$(a;i,a;2) = ^ <I>j(a:;i, xa)^-'/ 
and collect powers of S^''^ to obtain an infinite set of inhomogeneous differential equations. 
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with e2i+i = 0. The case p = corresponds to a pair of independent harmonic oscillators, with the solution 

Co = {vi + i) Wi + (j/2 + i) CJ2 + wo,o, (16) 

$0(2^1, 2:2) = /),,.i(wy^a;i)ft,^2(w2 a;2), (17) 

where vi and z^2 are quantum numbers for the normal modes corresponding to motion perpendicular to and parallel 
to the magnetic field, respectively, and the hi are the one-dimensional harmonic oscillator eigenfunctions. The cor- 
respondences between these large-dimension quantum numbers and the conventional labels for the low, intermediate, 
and high field cases will be detailed elsewhere.Ej 

III. TENSOR METHOD 



The higher-order terms in the energy and wavefunction expansions may be computed using the linear algebraic 
method.u The wavefunction expansion terms $j (xi , X2 ) are expanded in terms of the harmonic oscillator eigenfunctions 
hi. 



$j(xi,X2) =X!X! *''*''«J^il('^l 2;i)/li2(t^2 2:2), 
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so that the representation slj is a tensor of rank 2. The displacement coordinates Xi are represented in this basis by 
the matrix 
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and the Hamiltonian 7ij>o is represented as a linear combination of direct (outer) products of these matrices, namely 
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Using these representations, the j — term of Eq. (^4), {Tio — £o)^p, becomes 

(Ho - eoI)ap = ((^iKi) (tj2K2))ap, 



(20) 



(21) 



where K^ is a diagonal matrix with element {j,j) equal to j — Vi, since the basis functions hi-^ioj^ a;i)/ii2('^2 2:2) 
are eigenfunctions of TLq. Let us pause at this point to be sure that there is no confusion regarding the notation. A 
"direct sum" is analogous to the direct product operation, i.e. C = A B, where A and B are matrices, gives a 
rank 4 tensor with elements Cj-^kij2k2 — ^jifci + ^j2fe2- Then the multiplication E = CD, where D and E are rank 2 
tensors, implies Ej-^j^ ~ Cj-^kij2k2Dkik2i where we are using the implied summation convention. If we define a rank 4 
tensor K with elements 



1, jl = fci == J^i, J2 = fc2 = 1^2 

^3lki32k2 = { (^l(jl - '^l)+^2{J2 - >^2)y^, jl = ^1,^2 = fe, jl 7^ 1^1 and/or J2 ^ V2 , 

0, otherwise 



(22) 



then it may be verified that K((ci;iKi) (^2X2)) is equal to the direct product I (8) I except for a zero in element 
i'ii^izy2i^2- However, due to the orthogonality condition 



a^a.p = (5o,p, (23) 

where do,p is the Kronecker delta, we have 

K{{LJiKi) ® {uj2K2))ap = Sip, p>0. (24) 

Therefore if we multiply Eq. (n3) on the left by K, we obtain a recursive solution for a.p, 

ap = -K^(H,-e,I)ap_,. (25) 



If we instead multiply Eq. (14) on the left by a^ and use the orthogonahty condition of Eq. (|23|), we obtain an 



expression for ep in terms of a.j<_p, 

p 
Cp^^a^Ujap-j. (26) 
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From a computational standpoint, the only operation with which we need to concern ourselves is the multiplication 
of a rank 4 tensor with a rank 2 tensor, Hja^^j. According to Eq. (p2|), the nonzero elements of K appear in Eq. (pS) 
simply as scaling factors for each element of ap after the summation has been completed. 

IV. IMPLEMENTATION 
A. Rewriting the recursion relations 

In order to solve Eqs. (p5| ) and (|2^) recursively, we introduce the rank 2 tensor A.jin which is defined as 

^,to = (xr''®x2')an, (27) 

so that the Hjap_j term which most concerns us may be expressed as a linear combination of these tensors, namely 

/0+2)/2 \ 

Hjap-j = \ 2^ ''Vjj+2Aj+2,Lp-] \ +VjjAj,o,p-j +Vjj-2Aj-2fi,p-j- (28) 
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In order for this change of notation to be beneficial, we need to find some simple recursion relation(s) relating the 
Ajin tensors for a given order to those tensors which have been used at lower orders. In addition, we hope that only a 
small subset of the tensors used for previous orders are needed, so that a managable number of these tensors (and the 
wavefunction tensors a„) need to be allocated in memory. Two recursion relations are immediately evident, namely 

Aj + iJ^n = (Xl (gl l)Ajln, (29a) 

Aj+2A+l,n = (I «) xl)Ajln- (29b) 

It remains to be seen how many terms may be computed using these relations, and conversely how many will need 
to be computed "from scratch." We show in Fig. |l| the tensors Ajin which appear in the first term of Eq. ( P8|) for 
the first few orders p, a nd al so indicate which recursion relations (if an y) m ay be used to compute each tensor (with 



preference given to Eq. (29a) if bo th ar e applicable). We see that Eq. (29a) simply "shifts" the staircase- like pattern 



of tensors to the right, while Eq. (29b) fills in all but three of the gaps left by this shift. Turning our attention to 
the second term of Eq. (|2^), we note that at order p this term represents tensors which have appeared in the first 
term at order p — 2, with the exception of ^i.o.p-i and ^2,o,p-2- Similarly, the third term of Eq. (ESh at order p 
involves tensors which occured in the second term at order p — 2, with the exception of ^4o,o.p-2, which is simply the 
wavefunction tensor ap_2 which is resident in memory. Therefore, at each order we have at most five tensors which 
cannot be computed from direct application of Eqs. (p9|). 

In reality, the computation of these five tensors is no more difficult than the two standard recursion relations: 



Aifi.p-1 = (xi (X)I)ap_i, 

^3,0,p-l = (Xi (g>I)Ai^o,p-l, 

A3A,p-l = (I (» X^)Ai,o,p-l, (30) 

A2fi,p-2 = (Xi (8)I)ap_2, 

^4,2,p-2 = (I«)X2)[(I«)X2)ap„2]- 

We note that these relations, as well as the two standard recursion relations, only involve computation along one 
coordinate axis. 

By properly ordering the computation steps, we may compute all of the tensors Ajin at order p "in place" (i.e. 
overwriting all of the tensors from order p—1), thus minimizing the need for temporary storage. The second recursion 
relation uses tensors from order p—2, which must be temporarily stored during the order p—1 calculation. However, 
this relation is only used for {p— 2)/2 tensors at order p, so the additional storage cost is minor. The algorithm may 



be summarized as follows (see also Fig. El): (1) Recursion Eq. (29a) is applied to all tensors, working from right to left 
in Fig. H to avoid t he n eed for temporary sto rage . Note that this step does not overwrite those tensors which we will 



need later for Eq. ( 29b ). (2) Recursion Eq. (|29b|) is applied to tensors from order p — 2 which are stored separately. 
By again working from right to left, we may replace the tensors in temporary storage as they are used with the 
corresponding tensors from order p — 1, so that both the main set of tensors and the temporary set are updated as 
they are used. (3) Five original tensors (Eq. (pQ)) are computed. (4) ap is computed by taking a linear combination 
of tensors, and the contributions of those tensors which are necessary for ap+2 and ap+4 are now computed. 

B. Implementing the recursion relations on the Connection Machine 

We have implemented the computation of the energy eigenvalues, as described in the previous section, using a 
data-parallel algorithm on the CM-5 Connection Machine computer. The CM-5 computer consists of up to 16,384 
processor nodes, each consisting of a Sun Sparc processor and four vector units for fast floating-point computation. 
It supports both the data-parallel and the message-passing styles of parallel computation. 

The data-parallel capabilities of the CM-5 are supported in the form of high-level languages: C* is a data-parallel 
extension of the C programming language; while CM Fortran is similar to the Fortran 90 draft standard, augmented 
by some features from High-Performance Fortran (HPF) such as the FDRALL statement. For this work, we chose to 
use CM Fortran. 

The basic parallel data type in CM Fortran is the usual Fortran array. When the CM Fortran compiler encounters 
a DIMENSION statement, it allocates memory to spread the array across the processors of the CM-5. If it is desired 
that certain axes of a multidimensional array be localized to a single processor, that can be arranged by a simple 
compiler directive, called the LAYOUT directive. Axes localized in this way are called serial axes, while those spread 
across the processor array are called news axes. 

Operations on corresponding elements of arrays with identical layouts can then be performed by each processor in 
parallel. If there are more array elements than physical processors, the compiler arranges for each physical processor 
to contain a suhgrid of multiple array elements, and array operations are then multiplexed over these elements as 
required. The compiler's orchestration of this process is completely transparent to the CM Fortran user, who may 
then regard each array element as living in its own virtual processor. 

In this picture, operations involving noncorresponding array elements require interprocessor communication. Data- 
parallel programming languages, such as CM Fortran, support many different varieties of this. For the problem at 
hand, however, only nearest-neighbor communication is used. 

Nearest-neighbor communication is needed when, for example, you would like to take a linear combination of the 
array elements in (virtual) processors, i — 1, i, and i -I- 1. It is supported by a Fortran 90 (and CM Fortran) intrinsic 
function known as CSHIFT. If A is a CM Fortran array, and i and j are integers, then CSHIFT(A,i, j) is another such 
array, whose elements are shifted along axis i by an amount j . There is another variant of this function, known as 
EOSHIFT, that does basically the same thing, differing only in its treatment of the boundary elements of the array; 
whereas CSHIFT treats the boundaries as though the array were periodic, EOSHIFT has extra arguments for arrays of 
codimension one to be inserted at the boundary. 

Our storage method uses the fact that the position matrices in the harmonic oscillator representation, given by 
Eq. (|l9|), are doubly banded. We store only the nonzero elements of the ith row in (virtual) processor i. Thus, one 
can think of the band below the diagonal as a one-dimensional array, XL, and that above the diagonal as another 
one-dimensional array, XU. Then, XL(i) and XU(i) contain the two nonzero elements in row i, and we can write 

Xq/3 = XL(a)(5Q_i,/3 +VJ{a)5a+i,l3- 



The main operation needed for the implementation of the recursion relations, Eqs. (|29|), is the inner product of the 
position matrices with the matrices, ^j^i^n- The latter are stored as dense three-dimensional arrays, with two news 
axes for the components of the matrices, and one serial axis representing the allowed combinations of j, I, and n. 
Thus, one can imagine these matrices as spread across the processor array, with corresponding components localized 
to the same (virtual) processor. 

Consider one of these matrices; call it A.f_ii, (where the Greek indices no w la bel the tensor components and not the 



serial axis, {j, l,n}, which is being supressed for the moment). Then Eq. (29a), which is the inner product of x with 
(the first component of) A., is 

(x^^ (g) I)A^^ - XL(e)^^-i,^ + xu(0^c+i,^- (31) 

Note that we can implement this in terms of two CSHIFT operations along the first axis of A. Since this must be done 
for all values of /i, we add this instance axis to the arrays, XL and XU. 

The complete algorithm for implementing the inner product then begins by dimensioning the arrays as follows: 

INTEGER m,n 

DIMENSION XL(nmax,nmax) , XU(nmax,nmax) , AA(nmax,nmax) 

The XL and XU arrays are then initialized as follows: 

FORALL(m=l:nmax,ii=l:iimax) XL(m,ii) = SQRT(ii-l) 
FORALL(m=l:nmax,n=l:iimax) XU(m,n) = SQRT(ii) 

where the FORALL statements are parallel DO-loop constructions supported in High-Performance Fortran (and CM 
Fortran). For example, the first such statements cause each (virtual) processor to take its own index, decrement it by 
one, and take the square root. Alternatively, rather than allocate a separate array for XU, we can obtain it from XL 
by a simple EOSHIFT operation. 

The inner product in Eq. ( pT| ) is then taken as follows: 

XA = XL*E0SHIFT(AA,1,-1) + XU*EOSHIFT(AA, 1 ,+1) 



Using this method, the recursion relations, Eqs. ( p9D and (|30|), can be implemented, and the terms in Eq. ( pS]) can 
be multiplied by the scalars, Vij, and summed in place. (Since the {j, Z,n} axis is serial, the arrays Aj^i^n are all 
aligned in the processor array.) Similarly, the sum over j in Eq. (|2g) can be taken (without the premultiplication by 
ao, which is independent of j.) 

Finally, the result for ep, given by Eq. (Ed), can be found quite easily. Since ao is an eigenvector in our representation, 
we can effectively premultiply it by simply taking the first component of ^ H^ap-j. The statement 

ep = XA(1,1) 

reaches into the (virtual) processor containing the (1,1) component of the array XA, pulls out its value, and writes it 
into the scalar (stored on the control processor) quantity ep, from where it can be manipulated, output, etc. 

Thus, the array extensions of Fortran 90 and High-Performance Fortran, as embodied in the CM Fortran language, 
give a convenient set of primitives for the efficient implementation of the dimensional perturbation theory algorithm. 

C. Contraction of axis lengths 

All of the wavefunction tensors a„ and the tensors Ajin used in their computation are allocated to the size of 
the final wavefunction tensor, ap^^^, which permits calculation of energy expansion coefficients through ep,„^^+i. 
However, we have not yet attempted to take advantage of the sparseness of these tensors. To illustrate the amount 
of room for improvement. Fig. S indicates the nonzero elements of the first few wavefunction tensors for the case of 
a ground state (j^i = 1^2 = 0) calculation. Since we would lose much of the efficiency of the routines for the recursion 
relations described above if we attempted to treat these as general sparse tensors, we will instead look for smaller-scale 
improvements . 

First, we see that since X2 only occurs in even powers, alternating columns are always zero and need not be stored. 
In addition to cutting the storage in half, this reduces the two next-nearest neighbor communications required for the 
(I (g) X2) multiplication to two nearest neighbor communications. 

Looking along the other axis, we see that either even or odd rows, but not both in a given tensor, contain nonzero 
elements, so that the rows may be paired up as long as we maintain some sort of flag telling us whether the even 
or odd row of the pair is represented. This also carries an extra communication benefit in addition to the storage 
improvement; one of the two nearest-neighbor communications in the (xi (g)I) multiplication is converted to an entirely 
local operation. 



V. PERFORMANCE 

We now turn to the mapping of these arrays to the individual processors on the Connection Mach ine, in the typical 



case where the array size exceeds the number of processors. The recursion relation of Eq. (29a) may be ap plied 
independently to different columns since the communication is entirely within columns. Similarly, Eq. ( 29b ) may 
operate independently on separate rows. Since the former recursion relation is applied much more often than the 
latter, we would expect that the optimal array layout would be that in which each processor operates on a small 
number of long column segments, i.e. the first (row) axis is "more serial" while the second (column) axis is "more 
parallel." 

We can in fact obtain a quantitative measure of this balance. Suppose we have an N x N matrix to be allocated 
on a system of M processors. If we assign n columns (or segments of columns) to each processor, then the length of 
each of these column segments is N^/nM, as shown in Fig. H. The first recursion relation may be implemented in 
CM Fortran as described in section IV. B, with the addition of an IF clause due to the contraction of the first axis, 
as described in the previous section. Using appropriately defined arrays XO and XLU, this may be accomplished as 
follows: 

IF (odd rows nonzero) THEN 

XA = XO*AA + XLU*E0SHIFT(AA,1,-1) 
ELSE 

XA = XO*AA + EDSHIFT(XLU*AA,1,+1) 
END IF 

Whichever branch of this clause is taken, there are three arithmetic operations on each array element, with a total 
cost of 3iV^iyi/M, where tA represents the time for a single arithmetic operation. The EOSHIFT operation will move 
n elements from any given processor into an adjacent processor, while the remaining N"^ /M — n elements require an 
on-processor move (except for the case n — N/M, where all elements are moved on-processor). Thus the EOSHIFT 
time is given by 

ntQ if n = N/M 

ntq + I -^ — n ) Im otherwise ' 

where tq represents the queue waiting time for interprocessor communication and Im is the on-processor move time. 
The second recursion relation can be effected by the CM Fortran statement 

XA = X20*AA + EOSHIFT (X2LU*AA, 2,-1) + X2LU*E0SHIFT(AA,2,+1) 

where the definition of the constant arrays X20 and X2LU is dependent on the contraction of the second axis, as 
described in the previous section. The arithmetic cost is ^N'^MIa/M and the cost of the two EOSHIFT operations is 



2N' 
nM 



tQ + 2(^ ~ ^jtM Otherwise 



For a program which applies the first recursion relation ni times and the second relation ?i2 times, the total time 
incurred by the two relations is 

trecur = Tr[(^"'i + ^^2)*^ + ("-1 + 2n2)tAf] + f "-i"- H j^ ) {tq ~ <m), (32) 

where we have neglected the two special cases for n described above. The first term is independent of the specific 
layout, described by the parameter n. Since the only interprocessor data motion in the entire program arises from the 
recursion relations, the second term can be used to determine the optimal array layout. Since tg > Im, we want to 
minimize the term nin + 2n2N^ /nM. Setting its derivative with respect to n equal to zero gives a simple expression 
for the optimal layout parameter Uopt, 



2n2JV2 

For example, a 30th order calculation using the implementation discussed here requires N = 128, rii = 18441, 
ri2 = 925. A 64-processor CM-5 contains 256 vector units (hence M = 256). Thus, we would estimate Uopt ~ 2.5, 



in agreement with the empirical observation that n — 2 results in faster run times than either n = lorn = 4ona 
64-processor CM-5 (see Table I). Special care must be paid to the special case n = NlM when M < N, for a layout 
with the first axis entirely serial {n — 1) may be optimal despite the fact that Eq. ( p3| ) gives a larger value for Uopt- 

Table I presents timings and the corresponding floating point rates for 20th and 30th order calculations. The 
recursion relations involve substantial interprocessor communication; for the problem sizes considered here, we have 
found that approximately 40% of the CM execution time is spent performing arithmetic, the remainder being consumed 
by communication operations. 

In spite of this, we were abla to obtain a performance of nearly 20 Mflops per CM-5 processor node for the largest 
subgrid sizes considered here.EJ Since the algorithm described here is fully scalable, we expect that if the subgrid size 
was kept fixed and the problem was ported to the largest CM-5's existing today (1024 processor nodes), the total 
performance would be about 20 Gflops. 

Keeping the subgrid size fixed, however, implies that the problem size grows with the hardware. Again referring to 
Table I, we see that if we move a fixed-size problem to larger machines, the per-processor performance degrades. This 
is because the correspondingly decreasing subgrid size means that a larger fraction of time is being spent on latencies 
(overhead costs). 

Thus, massively parallel implementations of this algorithm will be useful if there is something to be gained by going 
to larger problem sizes. The most obvious way to do this is to increase the order of the perturbation series, and hence 
the sizes of the arrays involved, the number of recursion relations that must be used, and the accuracy of the results. 
Here, however, we run into a problem. Table II shows that at 30th order we have exhausted (or nearly exhausted) 
the significant digits of the IEEE standard double-precision floating-point numbers used in the calculation. Thus, to 
grow the problem in this direction, we would need to employ quadruple-precision arithmetic. 

There are, however, other options for growing the problem size in a useful manner. Table III shows results for several 
different magnetic field strengths. Indeed, one of the more interesting things to study in chaotic systems of this sort is 
the continuous dependence of the energy levels on the magnetic field strength, or other system parameter. Variation 
of these parameters provides another dimension over which to parallelize, and thereby maintain large subgrid sizes. 

VI. CONCLUSIONS 

In Table II we give ground state expansion coefhcients e2j for two field strengths, B — I and 1000, computed 
in double-precision arithmetic. Comparison with quadruple-precision calculations shows that the accumulation of 
roundoff error causes a linearly decreasing number of significant digits in the coefhcients, most pronounced for those 
series with the slowest rate of growth in the coefficients. ■— ■ 

In their work. Bender, Mlodinow, and Papanicolaou applied Shanks extrapolation to accelerate convergence .liil For 
the higher-fOrder series we have computed, the large order divergent behavior due to singularities renders this method 
incffective.113 Consequently, we employ Pade approximants to sum our serieS|— Table III compares our 1/k resiilts for 
the ground state of the m — and —1 mainfolds with variational calculationstj and lower and upper bounds^ where 
available. As noted by Goodson and Watson, the order at which roundoff error leaves no significant digits in the 
expansion coefficient e2j may be determined by noting qualitative changes in the root and ratio testsE We also give 
in Table III these maximurn-orders which are attainable using double-precision arithmetic. Results for excited states 
will be presented elsewhere.EJ 

For general problems with t degrees of freedom, the a„ and An objects are tensors of rank t. It is clear that 
storing all of the tensor elements will rapidly become impractical; while a 128 x 128 array of double precision numbers 
consumes 131 kBytes of memory, a 128 x 128 x 128 x 128 object requires 2 GB. Even using a parallel I/O device 
such as the DataVault or Scalable Disk Array for temporary storage, it is evident that it will become necessary to 
incorporate the sparse nature of these data structures more explicitly and to reduce the communication (especially 
to external devices) at the exaense of additional arithmetic wherever possible. 

It has recently been shownEl that the recursion relations may be reordered so that only the a„ tensors need to be 
stored. This approach has been used in the computation of expansion coefficients for the helium atom, a system with 
three degrees of freedom. Numerical estimates of the storage and computational costs associated with this approach 
indicate that large order calculations (20th order or higher) are at present restricted to at most six degrees of freedom, 
e.g. a three-electron atom.EI However, a promising approach may be to combine low-order pepturbation expansions 
with other methods. In particular, the recent development of expansions about the D ^ limitE3 may provide means 
to augment the Z? — > oo results. Expansions about both of these limits can be enhanced by renormalization schemes. 
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Architecture P n CM time (sec) MFLOPS/sec/node 

CM-5 (VU) 

CM-2 
CM-5 (VU) 



p 




n 




CM time (sec) 




20th 


ORDER 


CALCULATION 


(N= 


=64) 




32 




2 








0.896 


128 




2 








0.730 




30th 


ORDER 


CALCULATION 


(N= 


= 128) 




256 




1 








11.91 


512 




1 








7.11 


4 




1 








22.78 


32 




1 








4.62 


64 




1 








3.81 


64 




2 








3.68 


64 




4 








3.75 


128 




2 








2.82 



5.012 


1.536 


0.588 


0.492 


19.668 


12.124 


7.608 


4.964 



TABLE I. Performance results on Connection Machines with P nodes, where the optimal array layout is described by n 
(see text). For the CM-2 using the slicewise execution model, each floating-point processor node is comprised of 32 bit-serial 
processors and a floating-point accelerator chip, so a full 64K CM-2 has 2048 processor nodes {M — P). For the CM-5, each 
processing node (PN) has four vector units, so Af = 4P in predicting performance. 
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£2, {B = 1) 



e2j (B = 1000) 



-1 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 



-1.577218587578393 

0.63293278553615i 
-0.3281655376303i 

0.189180754067i 
-0.12027751042 

O.IO22IO9I4 
-0.16855886 

0.465347 
-1.4863 

4.926 
-1.68(1) 

l-o(2) 



1.910051627706109(3) 

3.617659493467253(2) 
-1.617557954768096(3) 

8.71028965022799i(3) 
-5.85646941347957(4) 

4.5181691716960 (5) 
-2.552533124276(6) 
-4.755986149379(7) 

3.368118837759(9) 
-1.42851561625(11) 

5.37280298577(12) 
-1.876673877s (14) 

5.8348070582(15) 
-1.29765909o(17) 
-1.39749835o(18) 

4.76063918(20) 
-4.5262663o(22) 

3.3657464(24) 
-2.1791486(26) 

1.219808(28) 
-5.17984o(29) 

3.7664o(30) 

2.6805i(33) 
-4.3958(35) 

5.073i(37) 
-4.924(39) 

4.097(41) 
-2.68(43) 

7.86(44) 

1.5(47) 
-4.0(49) 



TABLE II. Expansion coefficients e2j for the ground-state energy, defined in Eq.(|l2), for field strengtiis B = 1 and 1000. 
Note that due to the scaling of Eq.(|5|), e should be divided by (D — 1)^ = 4 to give the rra = energy E. Subscripts specify the 
digit after the last digit which is expected to be significant, and the number in parenthesis following each entry is the power of 
10 multiplying that entry. 
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B 


order 


Present work 




Rosner et al. 


20 


Handy et al 


21 










Eb 


Iso state 






0.1 


7 


0.54752648 






0.547527 




1 


11 


0.831169 






0.83116C 


) 




2 


12 


1.022213 






1.022214 1.0222138 < Eb < 


1.0222142 


20 


20 


2.21539 






2.21539C 


) 2.215325 < Eb < 


2.215450 


200 


24 


4.7275 






4.727 


4.710 < Eb < 


4.740 


300 


25 


5.362 






5.361 


5.34 <Eb < 


5.39 


1000 


30 


7.67 


Eb, 


2p- 


7.662 
-1 state 


7.55 < Eb < 


7.85 


0.1 


10 


0.20084567 






0.2008456 




1 


16 


0.4565971 






0.456597 




2 


19 


0.599613 






0.59961C 


i 




20 


25 


1.46551 






1.4655 






200 


33 


3.3473 






3.3471 






300 


35 


3.83485 






3.8346 






1000 


38 


5.640 






5.63842 







TABLE III. Comparison of binding energies, Eb = S(|?^| + l)/2 — E, of the lowest enjesgy states of the m = and —1 
majiifolds obtained from the present 1/k expansion, variational calculations of Rosner et a/.,Ej and energy bounds of Handy et 
alEj Also given are estimates of the maximum perturbation order for which double precision arithmetic suppresses roundoff 
error sufficiently to yield the final coefficient with at least one significant figure. 
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Figures 



Fig. 1. J[.ji„ matrices arising from the first term in Eq.(|2g), for the first few orders p. Superscripts indicate which 
recursion relation, if any, of Eq.(p3) may be used to compute that matrix, with preference given to Eq.(29a). 

Fig. 2. Algorithm for applying recursion relations, using order p = 8 as an example. Displayed is the state of the 
matrices An n af ter each of the following steps: (a) status after order p ~ I; (b) apply relation Eq.(29a); (c) apply 
relation Eq.(29a); (d) compute three new elements. Open and filled circles denote matrices from order p — 1 and p, 
respectively. 

Fig. 3. Nonzero elements of the first few wavefunction expansion matrices a.p for the ground state case. 

Fig. 4. Layout of an iV x iV matrix on M processors. 
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FIG. 1. 
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